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Abstract 

Background: Short tandem repeats are ubiquitous in genomic sequences and due to their complex evolutionary 
history pose a challenge for sequence alignment tools. 

Results: To better account for the presence of tandem repeats in pairwise sequence alignments, we propose a 
simple tractable pair hidden Markov model that explicitly models their presence. Using the framework of gain 
functions, we design several optimization criteria for decoding this model and describe resulting decoding 
algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms tailored 
to our model. We compare the accuracy of individual decoding algorithms on simulated and real data and find that 
our approach is superior to the classical three-state pair HMM. 

Conclusions: Our study illustrates versatility of pair hidden Markov models coupled with appropriate decoding 
criteria as a modeling tool for capturing complex sequence features. 
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Background 

In this paper, we explore the use of pair hidden Markov 
models (pair HMMs, PHMMs) in improving the quality 
of pairwise sequence alignment in the presence of tandem 
repeats. We propose a simple tractable model that explic- 
itly accounts for short tandem repeats, and we use the 
framework of maximum expected gain to explore a variety 
of decoding optimization criteria for our model 

Pair HMMs have for a long time played a major role 
in sequence alignment [1]. The traditional Needleman- 
Wunsch algorithm [2] and its variants can be easily for- 
mulated as a special case of alignment with PHMMs (we 
call this approach Viterbi decoding). The main advantage 
of PHMMs is that they allow us to express the scoring 
scheme in a principled way in the context of a probabilistic 
model. 

Sequence alignments are a mainstay of comparative 
genomics. By comparing sequences that evolved from a 
common ancestor, we can infer their phylogenetic rela- 
tionships, discover sites under functional constraint, or 
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even shed light on the function of individual sequence ele- 
ments. However, comparative genomic methods are very 
sensitive to the quality of underlying alignments, and even 
slight inaccuracies may lead to artifacts in the results of 
comparative methods. 

It is very difficult to evaluate alignment accuracy, yet 
even simple statistics can reveal artifacts of present-day 
algorithms. Lunter et al. [3] demonstrated systematic 
biases caused by the optimization criteria set by the 
Needleman-Wunsch approach. They show that by using 
variants of the posterior decoding instead of the tradi- 
tional Viterbi algorithm, one can significantly increase 
the alignment quality. The advantage of the posterior 
decoding is that it summarizes information from all align- 
ments of the two sequences, while the Viterbi decoding 
seeks only one highest scoring alignment. The posterior 
decoding was also found superior by other authors [4-6]. 

An algorithm by Hudek [7] is an intermediate between 
the Viterbi and posterior decoding, summarizing prob- 
abilities of alignments within short blocks. The goal is 
to segment the alignment into blocks, where each block 
has gaps in only one of the two sequences. The decod- 
ing algorithm considers each block as a unit, summing 
probabilities of all alignments that have the same block 
structure. Finally, Satija et al. [8] have demonstrated that 
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fixing a particular alignment is not necessary in some 
comparative genomics applications, instead one can con- 
sider all possible alignments weighted by their probability 
in the PHMM. 

In this paper, we concentrate on modeling sequence 
alignments in the presence of tandem repeats. Short tan- 
dem repeats cover more than 2% of the human genome, 
and occur in many genes and regulatory regions [9]; in 
fact, majority of recent short insertions in human are due 
to tandem duplication [10]. Evolution of tandem repeats is 
dominated by tandem segmental duplications resulting in 
regions composed of a highly variable number of almost 
exact copies of a short segment. Such sequences are diffi- 
cult to align with standard scoring schemes, because it is 
not clear which copies from the two organisms are orthol- 
ogous. Misalignments due to the presence of short tandem 
repeats are usually not limited to the repetitive sequence 
itself, but may spread into neighboring areas and impact 
the overall alignment quality (see Section Experiments). 

Sequence alignment with tandem duplication was first 
studied by Benson [11]. They propose an extension of 
the traditional Needleman-Wunsch algorithm that can 
accommodate tandem repeats in 0(n^) time. They also 
propose several faster heuristic algorithms. Additional 
work in this area concentrated on computing variants 
of edit distance either on whole sequences with tandem 
arrays or on two tandem arrays using different sets of 
evolutionary operations [12-14]. 

The first probabilistic approach to alignment of tandem 
duplications was introduced by Hickey and Blanchette 
[15], who developed a new probabilistic model by combin- 
ing PHMMs with tree adjoining grammars. Their model 
favors tandem duplications over other insertions, but 
the approach does not explicitly model whole arrays of 
tandemly repeated motifs. Moreover, algorithms to train 
and decode such models are relatively complex. 

Some protein families (such as zinc finger proteins) con- 
tain repetitive motifs similar in nature to tandem repeats 
in DNA. To align such proteins, Kovac et al. [16] com- 
bined profile HMMs (capturing the properties of the 
repeating motif) and PHMMs (modeling alignments) into 
a single scoring scheme that can be decoded by a newly 
proposed algorithm. However, their scoring scheme is no 
longer a probabilistic model and the method is focused on 
correctly aligning individual occurrences of a single motif 
rather than alignment of long sequences interspersed with 
multiple motifs. 

Here, we propose a simple tractable PHMM for 
sequence alignment with tandem repeats, and we explore 
various decoding methods for its use in sequence align- 
ment. In addition to the classical Viterbi decoding, we 
define several variants of the posterior decoding as 
well as block-based methods tailored to the specifics 
of our model. To demonstrate the differences, we have 



implemented several of these methods and compared 
their performance. 

Pair HMMs for alignment with tandem repeats 

Tandem repeats may arise by a complicated sequence of 
evolutionary events, including multiple rounds of tandem 
duplication, deletion, point mutation, gene conversion 
and other phenomena. Tandem repeat arrays at ortholo- 
gous locations in two related species may have arisen in 
the common ancestor and thus share part of their evo- 
lutionary history, but they could be further modified by 
independent events occurring after speciation. Models 
attempting to capture such diverse evolutionary mecha- 
nisms usually lead to complex problems in inference and 
parameter estimation. We propose two tractable mod- 
els, based on classical PHMMs, which still capture the 
essence of a tandem repeat array: periodically repeating 
motif, which may be shared between the two species, or 
be specific to one species only. 

A PHMM defines a probability distribution over align- 
ments of two sequences X and Y, The standard PHMM 
has three states (see Figure 1): match state M generat- 
ing ungapped columns of the alignment, and two insert 
states Ix and /y, where Ix generates alignment columns 
with a symbol from X aligned to a gap, and ly gener- 
ates columns with a symbol from Y aligned with a gap 
[1]. In our work, we will use a more complex PHMM, but 
standard algorithms for inference are still applicable. 

We call our main model SEE, and its details are 
shown in Eigure 2. The model contains a standard 
three-state PHMM and two "sunflower" submodels Ri^x 
and Ri^Y for each possible repeating motif /. Sub- 
model Ri^x generates several (possibly zero) copies of 
the motif in sequence X and submodel Ri^y gener- 
ates motif copies in sequence Y, Each copy of the 
motif is generated independently and the number of 
copies in X and Y are independent and geometrically 
distributed. 




Figure 1 The standard three-state PHMMs. The match state M 
generates ungapped columns, insert states Ix and // generate 
symbols aligned to a gap. 
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Each sunflower submodel is a circularized profile HMM 
emitting copies of the motif in one of the two sequences. 
For a motif of length p, the submodel contains p match 
states Mo, . . . each match state emitting one sym- 

bol of the motif. Insertion state Ij allows to insert addi- 
tional characters between symbols emitted by Mj and 
mod/?- Deletion states Dj and Dj allow to bypass 
match state Mj, and thus correspond to deletions with 
respect to the reference motif sequence. Since the sub- 
model can emit multiple tandem copies of the motif, 
the states in column — 1 are connected to the states 
in column 0. To avoid cycles consisting solely of silent 
states, we use two separate chains of deletion states. Chain 
Dq, . . . ,D^_2 can be entered only in state Dq, and the 
model can stay in this chain for at most p — 1 steps. Chain 
Di, . . . , Dp-i can be entered only after visiting at least one 
match or insert state in the current copy of the motif. As 
a result, the model can never pass around the whole cir- 
cle using delete states. Note that the model prefers integer 
number of repeats, even though partial repeat occur- 
rences are common in the real data. If desired, this can 
be addressed by simple changes in the model topology or 
parameters. 

The overall model can have sunflower submodels for an 
arbitrary number of motifs; we can even define an infi- 
nite model, in which every possible finite string serves as 
the motif for one pair of sunflowers. In our work, we use 
k = 310,091 motifs chosen as consensus strings of all 
tandem repeats found by the TRF program [17] run on 
the human chromosome 15 and its orthologous sequences 
in the dog genome. The probability of choosing a par- 
ticular motif out of all k possibilities can be uniform or 
dependent on the motif length or composition. We assign 



this probability based on the observed frequency of the 
corresponding consensus pattern in the TRF output. 

Likewise, we could use a multiple alignment of real 
motif occurrences to set individual parameters of the pro- 
file HMM. Instead, we use the same set of parameters for 
all states of all motif submodels. In particular, we set the 
insert and delete rates to 0.005; the match states allow 
mutations away from consensus according to the Jukes- 
Cantor model with parameter t = 0.05. Parameters of the 
three-state PHMM were estimated from the UCSC align- 
ment of the human chromosome 15 and its orthologous 
regions in the dog genome. 

Our model also assumes that individual copies of a fixed 
motif are independent. If they share part of their evolu- 
tionary history, this assumption is not valid, but it greatly 
simplifies the model. We could add some limited depen- 
dence by introducing repeat submodels emitting copies in 
the two sequences simultaneously; we have used such a 
model in a different setting in our previous work [16]. 

We have also explored a smaller model of tandem 
repeats based on an approach developed for repeat mask- 
ing by Frith [18]. This approach, called TANTAN, repre- 
sents all repeats with consensus of length k by a single 
state Ri(, To achieve this, state R/^ has emission table of 
order /c. According to a traditional definition [1], in a state 
with emission table of order /c, the probability of a partic- 
ular symbol at position / depends on symbols at positions 
/—/c, — 1. In TANTAN, however, the emission at posi- 
tion / depends only on the residue at position / — k in such 
a way that the two symbols k positions apart are equal 
with high probability. Thus a single state of order k can 
generate a tandem repeat with period /c, allowing for mis- 
matches between consecutive motif occurrences. Overall, 




Figure 2 SFF (sunflower field) model. A pair hidden Marl<ov model for alignment with tandem repeats. Each submodel /?,;« (left) is a circular 
profile hidden Markov model emitting tandem copies of the motif in one sequence. State Mj is the match state generatingyth symbol of the motif, 
state Ij allows insertions between symbols j and j + 1 of the motif, and states Dj and D| allow to skip state Mj. States s and e designate the entry and 
exit points from the submodel. The full SFF model (right) contains a standard three-state PHMM with states M, Ix and ly, and two submodels /?,;/, /?,;/ 
for each motif/. States and submodels with subscript / and Y generate symbols in the respective sequence X ofY only. 
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the TANTAN submodel for a single tandem repeat of 
unknown length up to K consists of states T^i, . . . , Rk con- 
nected to initial and final states, as shown in Figure 3. To 
allow insertions and deletions with respect to the motif 
consensus, TANTAN uses chains of insertion and deletion 
states connecting states of different order. 

We have slightly modified the TANTAN model for our 
purposes. TANTAN uses the repeat states starting from 
the second occurrence of the motif, because only at that 
point we start to see correlations with symbols k positions 
apart. To include the first repetition of the motif in the 
repeat submodel, we have added a prefix path of states 
Pif . . .,Pk such that P/^ is connected to R/^ (see Figure 3). 

To keep the number of parameters low, we use the same 
transition probability pr to move from any P/^ to R/^ for 
any k < K, States Pj^ and insert states generate symbols 
according to the background distribution. Repeat state R/^ 
assumes that the symbol at current position / has evolved 
from the symbol at position / — k according to Jukes- 
Cantor model with fixed time t. Finally, parameters pES> 
PERy and pei govern the probability of transition to end 
state from indel state, repeat state and init state, respec- 
tively, and ps, psE denote the probability of starting an 
indel and the probability of extending the indel. 

To build a full PHMM, we use two copies of the TAN- 
TAN submodel, each emitting tandem repeat in one of the 
sequences. These two copies are connected to the stan- 
dard three-state pair HMM representing non-repetitive 
sequence. We call this model TANTAN, although it differs 
from the model introduced by Frith [18]. 

Parameters of the TANTAN PHMM were set similarly 
as in the SFF model. Parameters pESy Per> Pei and t of the 



TANTAN submodels were estimated by the Baum-Welch 
algorithm [1] on 500 repeats sampled from SFF. 

The size of the SFF model is proportional to the sum 
of the lengths of all consensus sequences. In contrast, the 
size of the TANTAN model is proportional only to the 
length K of the longest represented motif. Thus SFF can 
be exponentially larger than TANTAN if it includes all 
possible consensus motifs of length up to IC Both SFF and 
TANTAN share the same overall structure, each consist- 
ing of the three-state pair HMM and a repeat submodel 
generating repeats separately in the two sequences. In 
the next section, we describe several inference methods 
appropriate for both of these models. 

Inference criteria and algorithms 

Given one of the PHMMs introduced in the previous 
section, and two sequences X = x\ . . .Xn and Y = 
yi . . .jmi we wish to find the alignment of these two 
sequences best agreeing with the model. We can also 
annotate this alignment by labeling individual alignment 
columns with additional information. We start by defining 
an alignment and its annotation more formally (see 
Figure 4). An alignment of X and 7 is a sequence of 
pairs iai, bi), . . . (at, bt), each pair representing one align- 
ment column. Symbol at represents either a position in 
X, or a gap annotated with the position of the nearest 
non-gap symbol on the left; formally at e {1, . . U 
{— 0, —1, . . — fz}. To specify a valid alignment, a\ must 
be 1 or —0, at must be n or — ^, and if at g {/, — y}, ai^i 
must be 7 + 1 or — y. The conditions on symbols bi repre- 
senting positions in sequence Y are analogous. The state 
annotation of an alignment is a sequence of states si. . .St 



(a) 




(b) 









Tx 




Ty 









(c) 

Figure 3 TANTAN-like repeat model for repeat motifs of length up to /C = 4. (a) Model by Frith [1 8] representing one tandem repeat witliout 
indels. States s and e are silent, state has emission table in which symbol at position / depends on symbol at position / - k. (b) Full model by Frith 
[1 8] allows insertions using states I],... Ik-] and deletions using silent states D],...,Dk. Our modification includes states P] ...Pk for the first 
occurrence of the motif, (c) Full pair HMM consists of two copies Tx and Ty of the model from part (b), one for each input sequence, and the 
traditional three-state pair HMM for non-repeat regions. 
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such that state 5/ generated alignment column (ai, bi). The 
repeat annotation is a binary sequence ri . . . r^, where 
Vi = 1 if the state Si generating the i-th. column is one 
of the states in the repeat submodels. While the state 
annotation can be used with any PHMM generating the 
alignment, the repeat annotation is appropriate only for 
PHMMs explicitly modeling repeats. 

We will explore several inference criteria for choosing 
the best alignment. To describe them, we will use the ter- 
minology of gain functions [19]; analogous notion of a 
loss functions is frequently used in statistics and machine 
learning. K gain function G{A,At) evaluates similarity 
between a predicted alignment A and the correct align- 
ment At) higher gain means that the prediction is of 
higher quality. Since the true alignment A 7 is not known, 
we will consider the expected gain G(A, Y] 
of alignment A, assuming that sequences X and Y were 
generated by our model 

EaAG{A\At)\X, Y] = Y, G{A,At) Pr(Ar|X, Y), 

At 

In each optimization criterion, we choose a particular 
gain function and look for alignment A* maximizing the 
expected gain A* = argmax^ G(A, A F]. Note 
that the gain function is only a way of defining the optimal 
solution; the corresponding decoding algorithm needs to 
be designed on a case-by-case basis. 

Decoding criteria for the three-state PHMM 

For simplicity, we start with criteria for the three-state 
PHMM, where the state annotation is uniquely deter- 
mined by the alignment itself. 

Viterbi decoding 

Perhaps the simplest gain function assigns gain +1 if the 
predicted alignment A is identical to the true alignment 
At> and 0 otherwise. To optimize this gain function, we 
need to find the alignment with the highest overall prob- 
ability in the model. In the simple three-state PHMM, 
this alignment can be found by the classical Viterbi algo- 
rithm in time 0(nmE), where E is the number of non-zero 
transitions in the model. 



Posterior decoding 

While the Viterbi decoding assigns gain only if the 
whole alignment is correctly predicted, posterior decod- 
ing assigns gain +1 for each correctly identified alignment 
column. Recall that the column is a pair (ai, bi), and it 
is considered correct, if the same column also occurs 
somewhere in the true alignment. The optimal align- 
ment under this gain function can be found by computing 
the posterior probability of each alignment column using 
the forward and backward algorithms for PHMMs, and 
then finding the alignment as a collection of compatible 
columns with the highest sum of posterior probabilities. 
A similar algorithm is considered for example by Lunter 
et al. [3], except that the column posteriors are multiplied 
rather than added. The running time of this algorithm is 
again 0(nmE). 

Marginalized posterior decoding 

Lunter et al. [3] also consider a variant of posterior 
decoding, where column —j) is considered correct and 
receives gain +1, if the true alignment contains column 
(/, —i) for any value of €. In other words, when symbol 
Xi is aligned to a gap, we do not distinguish where is the 
location of this gap with respect to sequence Y. Columns 
(— y, /) are treated symmetrically. To optimize this gain 
function, we again start by computing posteriors of all 
columns. Then we marginalize the probabilities of gap 
columns, effectively replacing posterior of column (/, —j) 
with the sum of posteriors of columns —i) for all €. As 
before, we then find the alignment maximizing the sum of 
posteriors of its columns. The algorithm runs in 0(nmE) 
time. 

Decoding criteria for the SFF Model 

In more complex models, including ours, one alignment 
can be generated by several different state paths. Gain 
functions can thus take into account also the state or 
repeat annotation of the alignment. 

Viterbi decoding 

In more complex models, the classical Viterbi algorithm 
optimizes a gain function in which the alignment is 




Figure 4 Example of an alignment represented in our notation, together with its state and repeat annotation. State r in the state sequence 
is a sliortliand for tine state M] witiiin submodel Rj^y of the SFF, wliere Rj^ represents motif A. 
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annotated with the state path generating it, and gain +1 is 
awarded only when both the aUgnment and the state path 
are completely correct. 

Posterior and marginalized posterior decoding 

We will consider a variant of the posterior decoding, in 
which alignment columns are annotated by the repeat 
annotation, and an alignment column gets a gain +1, if 
the true alignment contains the same column with the 
same label The only change in the algorithm is that the 
forward-backward algorithm produces posterior proba- 
bilities of columns annotated with the state, which are 
then marginalized over all states with the same repeat 
label. The running time is still 0{nmE), Similar modifi- 
cation can be done for marginaUzed posterior decoding, 
where we marginalize gap columns based on both state 
and gap position. 

Note that we might wish to completely marginalize over 
annotations and award gain only based on alignment. 
This method is however not appropriate for our models, 
because it would treat repeats as gaps, even if they have 
orthologs in the other sequence. 

Block decoding 

We will consider also a stricter gain function, which 
requires that repeat regions have correctly identified 
boundaries. We will split the alignment annotated with 
repeats into blocks^ so that each maximal region of con- 
secutive columns labeled as a repeat forms a block. Each 
column annotated as a non-repeat also forms a separate 
block. The gain function awards gain +1 for each non- 
gap symbol in every correctly predicted and labeled block. 
Correctness of non-repeat columns is defined as in the 
posterior decoding. A repeat block is considered correct, 
if exactly the same region in X and the same region in Y 
are also forming a repeat block in the true alignment. Note 
that the gain for each block is proportional to the num- 
ber of non-gap symbols in the block to avoid biasing the 
algorithm towards predicting many short blocks. 

To optimize this gain function, we first compute poste- 
rior probabilities for all blocks. Note that a block is given 
by a pair of intervals, one in X and one in Y, Therefore 
the number of blocks is OirP'nP'). The expected gain of a 
block is its posterior probability multiplied by the number 
of its non-gap symbols. After computing expected gains 
of individual blocks, we can find the highest scoring com- 
bination of blocks by dynamic programming in 0{rP'nP') 
time. 

To compute block posterior probabilities, we transform 
the SFF model to a generalized PHMM [20], in which all 
repeat states are replaced by a single generalized state R, 
In generalized HMMs, emission of a state in one step can 
be an arbitrary string, rather than a single character. In 
our case, the new state R generates a pair of sequences 



from the same distribution as defined by one pass through 
the repeat portion of the original SFF model. The pair 
of sequences generated by R represents one block of the 
resulting alignment. We call this new model the block 
model. Using the forward-backward algorithm for gener- 
alized HMMs, we can compute posterior probabilities of 
all blocks in 0(n^m^f) time where/ is the time necessary 
to compute emission probability for one particular block. 

If we naively compute each emission separately, we get 
f = 0(nmE). However, we can reduce this time for the 
SFF model as follows. If the SFF contains only one motif, 
the emission probability of sequences x and y in the R 
model is simply 

Pr (x,y \R) = Vr{x\ Ri,x) Pr {y \ Ri,y) , 

because the model first generates x in the sunflower sub- 
model Ri^x and then generates y in the model Note 
that these two models are connected by a transition with 
probability 1. In the general case, we sum the probabil- 
ities for all k motifs, each multiplied by the transition 
probability of entering that motif. To compute block emis- 
sion probabilities fast, we precompute Pr (x \ Ripe) and 
Pr (y I Rij) for all substrings x and y of sequences X and 
Y respectively. This can be done by the forward algorithm 
in 0{{rfl + nP')E) time. After this preprocessing, the com- 
putation of emission probability is 0(/:), and the overall 
running time of this algorithm is 0{krP'nP' + {rP' + nP')E). 

Block Viterbi decoding 

The final gain function we consider is a variant of the 
Viterbi decoding. The Viterbi decoding assigns gain +1 
for a completely correct alignment labeled with a correct 
state annotation. One alternative is to assign gain +1 if the 
alignment and its repeat annotation are completely cor- 
rect. This gain function considers as equivalent all state 
paths that have the same position of repeat boundaries 
but use different motifs or different alignments of the 
sequence to the motif profile HMM. 

In the SFF model, location of a repeat block uniquely 
specifies alignment within the block, because all sym- 
bols from sequence X must come first (aligned to gaps), 
followed by symbols from sequence Y, However, some 
models may emit repeat bases from the two sequences 
aligned to each other. We wish to abstract from exact 
details of repeat alignment, and consider different align- 
ments within a repeat as equivalent. Therefore, we will 
reformulate the gain function in terms of blocks. The 
alignment labeled with repeat annotation gets a gain 1, 
if all its blocks are correct, where block correctness is 
determined as in the block decoding. This formulation is 
similar to the one solved by Hudek [7]. 

To optimize this gain function, we use the Viterbi algo- 
rithm for generalized HMMs applied to the block model, 
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which leads to running time 0(kn^m^ + (n^ + m^)E), by 
similar reasoning as above. 

Practical considerations 

Even the fastest algorithms described above require 
OinmE) time, where sequence lengths n and m can be 
quite high when aligning long genomic regions and the 
size E of the SFF model depends on the sum of the lengths 
of all repeat motifs, which can be potentially even infinite. 
However, we can use several heuristic approaches to make 
the running time reasonable. 

First of all, we can use the standard technique of band- 
ing, where we restrict the alignment to some window 
around a guide alignment obtained by a faster algorithm. 
A simpler form of banding is to split the guide alignment 
to non-overlapping windows and realign each window 
separately. These techniques reduce the 0(nm) factor. In 
practical experiments we use Muscle [21] to create guide 
alignments and restrict the new alignment to be within 30 
base window from the guide alignment. 

To restrict the size of the model, we first find tandem 
repeats in X and Y independently by running the TRF pro- 
gram [17]. Then we include in the SFF model only those 
motifs which appear at least once in the TRF output. If 
we process only relatively short windows of the banded 
alignment, the size of the model will be quite small. Note 
however, that we keep the transition probabilities entering 
these models the same as they are in the full SFF model. If 
the TRF finds a consensus not included in the original SFF 
model, we add its two submodels with a small probability 
comparable to the rarest included motifs. 

These two heuristics sufficiently speed up algorithms 
running in 0(nmE) time. The block decoding and the 
block Viterbi decoding need to consider all possible 
blocks, which is prohibitive even within short alignment 
windows. Therefore, we allow repeat blocks only around 
intervals annotated as repeats by repeat annotators. Let 
us assume that we have a set of intervals Tx annotated as 
repeats in sequence X and a set of intervals Ty annotated 
as repeats in sequence Y, Note that intervals in each set 
are not necessarily disjoint. We restrict blocks by allow- 
ing the generalized repeat state R to generate the block 
of substrings x and y if each of these substrings is either 
empty or one of the intervals in Tx or Ty respectively 
has both its endpoints within 10 bases from the respective 
endpoints of x or j. Therefore, we try at most (100| T^l + 
n)(100\Ty\ + m) blocks. 

For running algorithms with the TANTAN model, it 
is not necessary to know consensus motifs beforehand, 
but we restrict possible repeats to the same intervals as 
with the SFF model for block decoding and block Viterbi 
decoding. 

To define repeat interval sets Tx and Ty, we first include 
all intervals found as tandem repeats by the TRF. However, 



the TRF searches for repeats in each sequence indepen- 
dently, which can cause problems. For example, a tandem 
repeat can be found in only one of the species, if it is 
less conserved in the other species or has there only one 
copy of the motif. The TRF can also find repeats in both 
sequences, but with slightly different consensus motifs, for 
example rotated by several bases. To solve these problems, 
we reannotate both sequences using a simple HMM which 
we call SRF (sunflower repeat finder). The SRF model for 
a motif consensus c consists of the sunflower submodel 
for motif c connected to a background state B, as shown in 
Figure 5. We use the Viterbi algorithm to annotate possi- 
ble locations of tandem repeats with consensus motif c in 
each of the two sequences X and Y and add these locations 
to the sets of intervals Tx and Ty, We repeat this process 
for every consensus c discovered by the TRF in the input 
sequences X and Y, In this way, if the TRF discovers a tan- 
dem repeat with consensus c in X, we have a better chance 
to discover appropriate orthologous region matching this 
consensus in Y and add its interval to Ty, 

The final consideration is that our PHMMs do not 
align tandem repeats at orthologous locations, even if 
they share a common evolutionary history. This might 
be impractical for further use. Therefore we postpro- 
cess the alignments by realigning all blocks annotated 
as repeats using the standard three-state PHMM. In 
this realignment, we also include gaps adjacent to these 
repeats. 

Experiments 

We have compared decoding methods described in the 
previous section and several baseline algorithms on sim- 
ulated data (see Table 1). The data set contained 200 
alignments of length at least 200 each generated from 
the SFF model (the same model parameters were used in 
the sampling and for the alignments). In generating the 
dataset, we required that each tandem repeat had at least 
three copies in both species; otherwise, we would obtain 
many regions that would be labeled as tandem repeats, 
but would in fact only have a single copy. The error rate 
(in the first column of the table) measures the fraction 
of true alignment columns that were not found by a par- 
ticular algorithm. It was measured only on the alignment 



Sc 

Figure 5 Sunflower Repeat Finder model. The model consists of a 
sunflower for a tandem repeat with a fixed consensus and a state 
generating non-repetitive part of the sequence. 
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columns that were generated from non-repeat states in the 
simulation, as the SFF model does not give any alignment 
in repeat regions. 

The methods based on the SFF model (the first block 
of the table) outperform the baseline method (the Viterbi 
algorithm on the three-state model), reducing the error 
rate by 15-30%. In general, the methods that score indi- 
vidual alignment columns are more accurate than the 
Viterbi-based methods, which is not surprising, because 
error rate as a measure of accuracy is closer to the gain 
function they optimize. 

The SFF-based algorithms use the tandem repeat motifs 
predicted by the TRF, as well as approximate repeat inter- 
vals (block-based methods). The TRF predictions are not 
exact and may contribute to the overall error rate. We 
attempted to quantify this effect by using the real tandem 
repeat motifs and real boundary positions instead of the 
TRF predictions (the second block of Table 1). We can 
see that the use of TRF predictions indeed leaves space 
for improvement, with the best performing algorithm 



reducing the error rate by 37% compared to the base- 
line. It would be interesting to explore other programs 
for detecting tandem repeats, such as mreps [22] or 
ATRhunter [23]. 

In addition to the baseline three-state model, we have 
compared our approach to other alignment programs 
Context [15] and Muscle [21]. The Context program, 
which uses context-sensitive indels, was trained on a sep- 
arate set of 200 alignments sampled from our model. 
Muscle was run with default parameters. Both programs 
produced alignments with higher error rate than the base- 
line three-state model. In case of Context, the high error 
rate may be due to insufficient training data or software 
issues. 

A traditional approach to the problem of aligning 
sequences with repeats is to first mask repeats in each 
sequence independently and then perform sequence 
alignment. To compare to this approach, we have masked 
tandem repeats using TRF and then used the three-state 
model with posterior decoding. However, this introduced 



Table 1 Accuracy of several decoding methods on simulated data 





Alignment 


Repeat 




Block 




Algorithm 


error 


sn. 


sp. 


sn. 


sp. 


SFF marginalized 


3.37% 


95.97% 


97.78% 


43.07% 


44.87% 


SFF posterior 


3.53% 


95.86% 


97.87% 


42.70% 


47.37% 


SFF block 


3.51% 


93.09% 


98.07% 


36.50% 


41.67% 


SFF block Viterbi 


3.91% 


93.26% 


97.96% 


35.77% 


40.66% 


SFF Viterbi 


4.04% 


95.29% 


97.85% 


42.70% 


48.95% 


TANTAN block 


5.05% 


61.38% 


97.48% 


0.00% 


0.00% 


TANTAN block Viterbi 


6.17% 


67.86% 


96.51% 


0.00% 


0.00% 


SFF marginalized* 


3.02% 


98.93% 


99.64% 


77.01% 


76.17% 


SFF posterior* 


3.42% 


98.84% 


99.51% 


75.91% 


80.93% 


SFF block** 


3.21% 


97.70% 


99.87% 


80.66% 


94.44% 


SFF block Viterbi** 


3.71% 


98.12% 


99.85% 


81.75% 


92.18% 


SFF Viterbi* 


3.94% 


98.54% 


99.45% 


75.55% 


83.47% 


TANTAN block"*" 


3.42% 


60.45% 


99.90% 


0.36% 


0.46% 


TANTAN block Viterbi"*" 


3.83% 


61.74% 


99.88% 


0.00% 


0.00% 


Context 


5.98% 










Muscle 


5.62% 










3-state posterior 


4.41% 










3-state posterior with masked repeats"*""*" 


5.03% 


99.23% 


74.16% 


7.66% 


7.24% 


3-state Viterbi (baseline) 


4.78% 










SFF marginalized^ 


3.63% 


96.03% 


97.74% 


42.70% 


43.33% 


SFF marginalized^^ 


3.36% 


95.99% 


97.81% 


40.88% 


43.08% 



*: method uses the real consensus motifs. 

**: method uses the real consensus motifs and intervals from the real repeat blocks. 

method uses Intervals from the real repeat blocks. 

Columns with at least one masked character are considered as repeats. 
^: Parameters for the three-state submodel were estimated from human-chicken alignment. 
Parameters for SFF submodel were perturbed randomly. 
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16 




^0 5 10 15 20 

distance from repeat 



Figure 6 Alignment error rate of tiiree decoding methods as a 
function of the distance from the nearest repeat. 



even more errors than running the same model on the 
unmasked sequences, ignoring the repeat issue altogether. 
One could object to our tests, since we have used the 
same model for generating data and for testing. There- 
fore, we were interested in how robust is our approach 
to changes in model parameters. In the first experi- 
ment, we estimated parameters of the 3-state submodel of 
the SFF model from human-chicken alignments (human 
chromosome 20) instead of human-dog alignments. In 
the second experiment, we randomly perturbed param- 
eters of the sunflower submodel (each parameter was 
changed randomly by an additive term ranging from 
0.02 to 0.05). In both cases, the results have changed 
only slightly (Table 1), showing that our model is quite 
robust. 

The decoding methods that use the SFF model produce 
an alignment and a repeat annotation. Comparing anno- 
tation of each base in both sequences with the true repeat 
annotation sampled from the model (table columns repeat 
sensitivity and specificity), we note that the marginal- 
ized posterior decoding is the most sensitive, and the 
block decoding the most specific method. Specificity was 
quite high for all methods, low sensitivity for block-based 



methods was probably caused by wrong repeat intervals 
predicted by the TRF, since it improves markedly by using 
correct intervals. 

We have also compared the accuracy of predicting 
repeat block boundaries (table columns block sensitiv- 
ity and specificity). The number of blocks with correctly 
predicted boundaries is quite low, most likely because 
usually there are many high-probability alternatives with 
slightly shifted boundaries. However, even though more 
than half of the repeat blocks have some error in the 
boundary placement, the SFF-based methods improve the 
alignment accuracy most markedly close to repeat bound- 
aries, as shown in Figure 6. This is expected, because far 
from repeats, the model works similarly to the three-state 
PHMM. 

Since the alignments were sampled from the SFF model, 
it is expected that the TANTAN model will be less accu- 
rate than the SFF model on this data. Somewhat surpris- 
ingly, TANTAN-based methods have even higher error 
rate than the baseline which uses a much simpler model. 
The TANTAN also had a significantly lower repeat sensi- 
tivity than the SFF and did not predict almost any block 
correctly. We believe that this is due to the inaccurate 
modeling of the first copy of the repeat, since annotations 
of TANTAN methods almost always skip the first copy of 
the repeat motif in both sequences. Merging prefix states 
Pi, ... ^Pk in both copies of the TANTAN submodel into 
one chain of match states might improve the annotation 
of the first copy of the motif and thus possibly also the 
overall error rate. 

To illustrate the feasibility of running our methods on 
real genomic data, we attempted to improve the quality 
of the alignment of the human chromosome 21 to the 
dog genome. In particular, we have downloaded mam- 
malian alignments from the UCSC genome browser [24] 
and extracted the pairwise alignment of human (assem- 
bly hgl9) and dog (assembly canFam2) from the multiple 
alignment for the human chromosome 21. In this way, 
we have obtained 20. 8M alignment columns. We used the 
TRF to annotate tandem repeats in both species. Then we 
selected short windows covering annotated repeats and 
their surrounding regions (average window length was 



Table 2 Evaluation of realignments of human-dog alignments (human chromosome 21 ) 




Clean 








Detected errors 






Algorithm 


transcripts 


Start 


Stop 


5'splice 


3'splice 


Nonsense 


Frameshifts 


Out of (dataset size) 


2207 


2790 


2180 


18898 


18898 






3-state Viterbi 


730 


292 


459 


236 


401 


174 


309 


UCSC alignments 


743 


294 


427 


239 


415 


173 


348 


SFF posterior 


750 


288 


425 


251 


384 


174 


358 


SFF marginalized 


682 


292 


425 


251 


384 


174 


438 


SFF block 


714 


292 


432 


254 


388 


169 


392 
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3-state Viterbi *** 

GTACCCGCGGCGGAGCGAGGAGGCGAGAATGGATCA— ATGGTGCCACGGAGCACATCGCTGGCG 
GTGCCCCTAGCAAAGCGATGAGGCTGCAACAGACCAGTGTGAGGAGGGGGAGTAAATCATTGATG 

SFF posterior *** 

gtacccgcggcggagcgaggaggcgAGAATG GATCAATGGTGCCAC GGAGCACATCGCTGGCG 

AGGCTGTGCCCCTAGCAAAGCGATGAGGCTGCAACAGACCAGTGTGAGGAGGGGGAGTAAATCATTGATG 

Figure 7 Repeat-aware alignment may result in a significant change compared to the standard alignment methods. Lowercase letters 
show a tandem repeat annotated by the SFF upstream of RefSeq gene NM_1 53681 . While the first alignment aligned the start codon ATG in human 
with GTG in dog, the SFF posterior aligned it to a potential ATG start codon. 



109 bases) and realigned each window by several of the 
methods described above. 

Since we do not know the ground truth for these 
alignments, only indirect evaluation is possible. We have 
downloaded RefSeq, ENSEMBL, VEGA and UCSC gene 
annotation catalogs from the UCSC genome browser and 
selected 2207 (possibly overlapping) transcripts that cor- 
rectly matched the human reference (correct start codons, 
stop codons, splice sites, no in-frame stop codons). We 
have used our alignments to remap each transcript into 
the dog genome, and evaluated resulting transcripts for 
correctness in the dog genome using the clean_genes tool 
from the PHAST package [25]. 

Table 2 shows success and error rates of variety of indi- 
cators of the resulting dog transcript annotation. The best 
performing SFF posterior approach was able to remap 750 
transcripts from the human to the dog genome (compared 
to 730 transcripts using the baseline three-state Viterbi 
approach and 743 transcripts in the original UCSC align- 
ments). The SFF-based models tended to decrease the 
error rate around 3' splice sites; on the other hand, they 
increase the error rate in some other parameters (e.g. 
5' splice sites or frame-shifts). Many of these differences 
are due to small shifts in gap locations, but occasionally 
we see a substantial alignment change, as illustrated in 
Figure 7. However, problems in mapped transcripts do not 
always indicate an alignment error, because gene and exon 
boundaries may change in evolution due to indels or sub- 
stitutions, particularly near 3' ends of genes. Also note 
that the UCSC alignments are extracted from multiple 
alignments of many species, thus incorporating additional 
information that was not available to our alignments 
based on the SFF model. 

Conclusions 

We have designed two new pair hidden Markov models 
for aligning sequences with tandem repeats and explored 
a variety of decoding optimization criteria for their use. 
The new SFF model coupled with appropriate decod- 
ing algorithm reduces the error rate on simulated data, 
especially around boundaries of tandem repeats. With 
suitable heuristics, our approach can be used to realign 
long genomic regions. 



Our experiments are the first study comparing a variety 
of different decoding criteria for PHMMs. Our criteria for 
the SFF model optimize both the alignment and the repeat 
annotation. Depending on the application, one or the 
other may be of greater interest, and thus one may want to 
marginalize over all alignments and optimize the annota- 
tion, as in [8], or marginalize over labels and optimize the 
alignment. 

Our model does not take into the account the depen- 
dencies between the repeat occurrences in the two 
species. A tractable model allowing such dependencies 
would be of great interest. Previously, we have explored 
the problem of aligning two sequences simultaneously to 
a profile HMM, but we were not able to design a simple 
generative model for this purpose [16]. 
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